Skip to content

<fix>[kvm]: defer skip list cleanup on nodeLeft to prevent VM split-brain#3330

Open
ZStack-Robot wants to merge 12 commits into5.5.6from
sync/ye.zou/fix/ZSTAC-80821
Open

<fix>[kvm]: defer skip list cleanup on nodeLeft to prevent VM split-brain#3330
ZStack-Robot wants to merge 12 commits into5.5.6from
sync/ye.zou/fix/ZSTAC-80821

Conversation

@ZStack-Robot
Copy link
Collaborator

Summary

  • ZSTAC-80821: MN 重启时 nodeLeft 直接清空 vmsToSkip,正在启动的 VM 被误判为 Stopped → HA 重复拉起 → 脑裂
  • 改为 orphan 延迟清理机制:nodeLeft 移入 orphanedVmsToSkip (带时间戳),60s 周期清理超时条目(默认10分钟)
  • VmTracer 增加二次确认:orphaned + Starting/Migrating 状态的 VM 跳过判定

Files Changed

  • KvmVmSyncPingTask.java — orphan 集合 + 定时清理
  • VmTracer.java — shouldSkipMissingVmHandling hook
  • KVMGlobalConfig.java — orphanedVmSkipTimeout 配置项

Resolves: ZSTAC-80821

sync from gitlab !9152

@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

Walkthrough

新增与扩展若干同步与追踪逻辑:为 KVM 同步任务添加“孤立虚机跳过”映射与 TTL 管理;在 VM 生命周期关键路径(启动、重启、恢复、从结构创建)后同步 VM 设备地址信息到主机;调整根盘分配尺寸计算以取虚拟/实际尺寸最大值;在 VM 状态机中允许从 Destroying 转为 Stopped;为 ZBS 存储控制器的 HTTP 调用添加可重试的 setTryNext 开关。

Changes

Cohort / File(s) Summary
KVM 同步与孤立 VM 跳过
plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java
新增 orphanedSkipVms 并发映射与 ORPHAN_TTL_MS(10 分钟);添加 cleanupExpiredOrphanedSkipVms() 并在同步开始时调用;nodeLeft() 将 per-node 跳过条目移入孤立映射并记录时间戳;追踪/同步流程和 isVmDoNotNeedToTrace() 扩展以考虑孤立 TTL,包含过期修剪与日志。
VM 设备地址同步
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
新增私有方法 syncVmDevicesAddressInfo(vmUuid),在 InstantStart 启动、常规启动完成、重启、恢复、以及基于新 struct 的创建等路径调用,用于向主机发送 SyncVmDeviceInfoMsg 并记录成功/失败日志。
根盘大小计算调整
header/src/main/java/org/zstack/header/vm/VmInstanceSpec.java
getRootDiskAllocateSize 中,若 rootDiskOffering 为 null,则比较镜像的 virtualSizeactualSize 并返回两者最大值(替代仅用 image size 的行为)。
VM 状态机微调
header/src/main/java/org/zstack/header/vm/VmInstanceState.java
在 Destroying 状态中新增对事件 VmInstanceStateEvent.stopped 的允许迁移,添加从 Destroying 到 Stopped 的转变路径。
ZBS HTTP 调用重试开关
plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java
在内部类 HttpCaller<T> 中新增 setTryNext(boolean tryNext) 方法并返回自身;将原来的直接 sync HTTP 调用替换为使用此 HttpCaller 并开启 tryNext,以支持对下一个 MDS 的可重试调用。

Sequence Diagram(s)

(本次修改为若干独立位置的功能增强与状态/逻辑调整,未引入跨 3+ 明确组件的全新顺序控制流,故省略序列图。)

代码审查工作量评估

🎯 3 (Moderate) | ⏱️ ~20 分钟

诗篇

🐰✨ 孤儿虚机静且闲,
十分钟后再探看端,
设备地址轻声诉,
状态机道又回还,
重试呼叫寻新岸。


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Title check ❌ Error 标题超过72字符限制(73个字符),且实际变更文件数量(5个)与标题仅关注KVM相关变更存在不符。 将标题缩短至72字符以内,例如:'[kvm]: defer skip list cleanup to prevent VM split-brain'(62字符)。
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed 描述与变更集相关,详细说明了orphaned延迟清理机制的目的、实现方式和相关文件修改,符合要求。
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into 5.5.6
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sync/ye.zou/fix/ZSTAC-80821

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.40.5)
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java`:
- Around line 524-526: The protected method isVmOrphaned in KvmVmSyncPingTask is
unused and duplicates logic already present via
orphanedVmsToSkip.containsKey(vmUuid) in isVmDoNotNeedToTrace and
shouldSkipMissingVmHandling; remove the isVmOrphaned method to eliminate
redundancy, or if you intend to keep it for future/semantic reasons, add a clear
comment above isVmOrphaned explaining why it is retained (including planned
usage or TODO ticket reference) and mark it `@SuppressWarnings`("unused") if
desired so linters know it is intentional.
🧹 Nitpick comments (2)
plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java (2)

319-352: 周期任务资源管理问题

startOrphanedVmCleanupTask() 提交了一个周期任务,但在 stop() 方法(第 415-418 行)中没有取消该任务。这可能导致组件停止后任务仍在运行。

建议保存 PeriodicTask 的引用并在 stop() 中取消:

♻️ 建议的修改

在类中添加字段保存任务引用:

private Future<?> orphanedVmCleanupTaskFuture;

修改 startOrphanedVmCleanupTask 保存返回值:

 private void startOrphanedVmCleanupTask() {
-    thdf.submitPeriodicTask(new PeriodicTask() {
+    orphanedVmCleanupTaskFuture = thdf.submitPeriodicTask(new PeriodicTask() {
         // ... existing code
     });
 }

修改 stop() 方法取消任务:

 `@Override`
 public boolean stop() {
+    if (orphanedVmCleanupTaskFuture != null) {
+        orphanedVmCleanupTaskFuture.cancel(false);
+    }
     return true;
 }

328-328: 建议提取魔法数字为常量

周期任务的间隔 60 秒直接硬编码在代码中。建议提取为类常量或考虑使其可配置,以便于维护和调整。

♻️ 建议的修改
+private static final long ORPHANED_VM_CLEANUP_INTERVAL_SECONDS = 60;
+
 private void startOrphanedVmCleanupTask() {
     thdf.submitPeriodicTask(new PeriodicTask() {
         // ...
         `@Override`
         public long getInterval() {
-            return 60; // Check every 60 seconds
+            return ORPHANED_VM_CLEANUP_INTERVAL_SECONDS;
         }
         // ...
     });
 }

Comment on lines 524 to 526
protected boolean isVmOrphaned(String vmUuid) {
return orphanedVmsToSkip.containsKey(vmUuid);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n "isVmOrphaned" --type java

Repository: MatheMatrix/zstack

Length of output: 180


🏁 Script executed:

rg "extends KvmVmSyncPingTask" --type java

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

sed -n '510,535p' plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java

Repository: MatheMatrix/zstack

Length of output: 1028


移除未使用的 isVmOrphaned 方法

搜索结果确认该 protected 方法在整个代码库中无任何调用,且不存在 KvmVmSyncPingTask 的子类使用该方法。此外,相同逻辑已在 isVmDoNotNeedToTraceshouldSkipMissingVmHandling 方法中直接调用 orphanedVmsToSkip.containsKey(vmUuid),存在冗余。请移除该未使用的方法,或添加注释说明其保留的具体原因(如未来功能规划)。

🤖 Prompt for AI Agents
In `@plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java` around lines
524 - 526, The protected method isVmOrphaned in KvmVmSyncPingTask is unused and
duplicates logic already present via orphanedVmsToSkip.containsKey(vmUuid) in
isVmDoNotNeedToTrace and shouldSkipMissingVmHandling; remove the isVmOrphaned
method to eliminate redundancy, or if you intend to keep it for future/semantic
reasons, add a clear comment above isVmOrphaned explaining why it is retained
(including planned usage or TODO ticket reference) and mark it
`@SuppressWarnings`("unused") if desired so linters know it is intentional.

…isk offering

Resolves: ZSTAC-74683

Change-Id: Id0339ed0221e92e506f60745cde972cc3ee6d9ae
@MatheMatrix MatheMatrix force-pushed the sync/ye.zou/fix/ZSTAC-80821 branch from b3f652d to 9e241fe Compare February 12, 2026 05:29
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java`:
- Around line 493-503: This has a TOCTOU race: after reading orphanedAt from
orphanedSkipVms you unconditionally call orphanedSkipVms.remove(vmUuid), which
can delete a newly inserted timestamp from another thread (e.g., nodeLeft). Fix
by performing an atomic conditional remove using orphanedSkipVms.remove(vmUuid,
orphanedAt) (or re-check the current value and only remove if it equals the
previously read orphanedAt) when ORPHAN_TTL_MS has expired; keep the existing
logging and ensure you reference the same orphanedAt and vmUuid values for the
CAS removal.
- Around line 72-76: The hardcoded ORPHAN_TTL_MS in KvmVmSyncPingTask reduces
operational flexibility; replace it with a KVMGlobalConfig entry (e.g.,
kvm.orphanedVmSkipTimeout) and read that value at startup (and/or subscribe to
updates) instead of using the fixed constant. Update KvmVmSyncPingTask to
retrieve the timeout from the new config key and use that value when managing
orphanedSkipVms/expiry checks (keep a sensible default of 10*60*1000 ms if the
config is missing), and ensure the code references ORPHAN_TTL_MS is replaced
with the config-backed field or accessor so the TTL becomes configurable at
runtime.
🧹 Nitpick comments (1)
plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java (1)

517-523: 建议:使用 CAS 移除以保持一致性

isVmDoNotNeedToTrace 类似,迭代器移除不保证原子性。虽然影响较小,但为保持一致性,建议使用 remove(key, expectedValue)

♻️ 可选改进
         while (it.hasNext()) {
             Map.Entry<String, Long> entry = it.next();
             if (now - entry.getValue() >= ORPHAN_TTL_MS) {
-                it.remove();
+                orphanedSkipVms.remove(entry.getKey(), entry.getValue());
                 logger.info(String.format("cleaned up expired orphaned skip entry for VM[uuid:%s]", entry.getKey()));
             }
         }

Comment on lines +72 to +76
// Orphaned skip entries from departed MN nodes. Key=vmUuid, Value=timestamp when orphaned.
// These VMs remain in skip-trace state for ORPHAN_TTL_MS to avoid false HA triggers
// when a MN restarts and its in-flight VM operations haven't completed yet. See ZSTAC-80821.
private final ConcurrentHashMap<String, Long> orphanedSkipVms = new ConcurrentHashMap<>();
private static final long ORPHAN_TTL_MS = 10 * 60 * 1000; // 10 minutes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "KVMGlobalConfig.java" -type f

Repository: MatheMatrix/zstack

Length of output: 124


🏁 Script executed:

rg -n "orphanedVmSkipTimeout|orphanedSkipTimeout|ORPHAN.*TIMEOUT" --type java -A 2 -B 2

Repository: MatheMatrix/zstack

Length of output: 552


🏁 Script executed:

rg -n "class KVMGlobalConfig" --type java -A 50

Repository: MatheMatrix/zstack

Length of output: 6922


🏁 Script executed:

cat ./plugin/kvm/src/main/java/org/zstack/kvm/KVMGlobalConfig.java

Repository: MatheMatrix/zstack

Length of output: 9331


🏁 Script executed:

rg -n "ORPHAN_TTL_MS|orphanedVmSkipTimeout|orphanedSkipTimeout" --type java -B 3 -A 3

Repository: MatheMatrix/zstack

Length of output: 4660


建议将硬编码的超时常量改为 KVMGlobalConfig 配置项

当前使用硬编码常量 ORPHAN_TTL_MS = 10 * 60 * 1000 限制了运维灵活性。建议在 KVMGlobalConfig 中添加可配置的超时参数(如 kvm.orphanedVmSkipTimeout),使不同环境可以根据实际需求调整孤立VM的保留时间。

🤖 Prompt for AI Agents
In `@plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java` around lines
72 - 76, The hardcoded ORPHAN_TTL_MS in KvmVmSyncPingTask reduces operational
flexibility; replace it with a KVMGlobalConfig entry (e.g.,
kvm.orphanedVmSkipTimeout) and read that value at startup (and/or subscribe to
updates) instead of using the fixed constant. Update KvmVmSyncPingTask to
retrieve the timeout from the new config key and use that value when managing
orphanedSkipVms/expiry checks (keep a sensible default of 10*60*1000 ms if the
config is missing), and ensure the code references ORPHAN_TTL_MS is replaced
with the config-backed field or accessor so the TTL becomes configurable at
runtime.

AlanJager and others added 11 commits February 12, 2026 13:52
When anti-split-brain check selects a disconnected MDS node, the HTTP
call now times out after 30s instead of 5+ minutes, and automatically
retries the next available MDS via tryNext mechanism.

Resolves: ZSTAC-80595

Change-Id: I1be80f1b70cad1606eb38d1f0078c8f2781e6941
When MN restarts during a destroy operation, the hypervisor may report
the VM as Stopped. Without this transition, the state machine throws
an exception and the VM stays stuck in Destroying state forever.

Resolves: ZSTAC-80620

Change-Id: I037edba70d145a44a88ce0d3573089182fedb162
Resolves: ZSTAC-82153

Change-Id: Ib51c2e21553277416d1a9444be55aca2aa4b2fc4
add syncVmDeviceInfo when vm is running

Resolves: ZSTAC-67275

Change-Id: I746e63786a677676676b6d6265657063666b7677
<fix>[compute]: add syncVmDeviceInfo when vm is running

See merge request zstackio/zstack!9197
<fix>[vm]: add Destroying->Stopped state transition

See merge request zstackio/zstack!9156
<fix>[i18n]: improve snapshot error message for unattached volume

See merge request zstackio/zstack!9192
<fix>[zbs]: reduce mds connect timeout and enable tryNext for volume clients

See merge request zstackio/zstack!9153
<fix>[vm]: use max of virtual and actual size for root disk allocation

See merge request zstackio/zstack!9155
…plit-brain

When a management node departs, its VM skip-trace entries were
immediately removed. If VMs were still being started by kvmagent,
the next VM sync would falsely detect them as Stopped and trigger
HA, causing split-brain.

Fix: transfer departed MN skip-trace entries to an orphaned set with
10-minute TTL instead of immediate deletion. VMs in the orphaned set
remain skip-traced until the TTL expires or they are explicitly
continued, preventing false HA triggers during MN restart scenarios.

Resolves: ZSTAC-80821

Change-Id: I3222e260b2d7b33dc43aba0431ce59a788566b34
…anup

Resolves: ZSTAC-80821

Change-Id: I59284c4e69f5d2ee357b1836b7c243200e30949a
@MatheMatrix MatheMatrix force-pushed the sync/ye.zou/fix/ZSTAC-80821 branch from 614603d to de83221 Compare February 16, 2026 02:23
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java (1)

8476-8496: 建议捕获 hostUuid 本地变量以避免并发下日志/目标不一致
当前回调里使用 self.getHostUuid(),若宿主在异步期间变化,日志与目标可能不一致。

♻️ 可选改进(保持语义不变)
 private void syncVmDevicesAddressInfo(String vmUuid) {
-    if (self.getHostUuid() == null) {
+    String hostUuid = self.getHostUuid();
+    if (hostUuid == null) {
         return;
     }
     SyncVmDeviceInfoMsg msg = new SyncVmDeviceInfoMsg();
     msg.setVmInstanceUuid(vmUuid);
-    msg.setHostUuid(self.getHostUuid());
-    bus.makeTargetServiceIdByResourceUuid(msg, HostConstant.SERVICE_ID, msg.getHostUuid());
+    msg.setHostUuid(hostUuid);
+    bus.makeTargetServiceIdByResourceUuid(msg, HostConstant.SERVICE_ID, hostUuid);
     bus.send(msg, new CloudBusCallBack(msg) {
         `@Override`
         public void run(MessageReply reply) {
             if (!reply.isSuccess()) {
                 logger.warn(String.format("Failed to sync vm device info for vm[uuid:%s], %s",
                     vmUuid, reply.getError()));
             } else {
                 logger.debug(String.format("Sent SyncVmDeviceInfoMsg for vm[uuid:%s] on host[uuid:%s]",
-                    vmUuid, self.getHostUuid()));
+                    vmUuid, hostUuid));
             }
         }
     });
 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants